33 research outputs found
Modelling frequency, attestation, and corpus-based information with OntoLex-FrAC
OntoLex-Lemon has become a de facto standard for lexical resources in the web of data. This paper provides the first overall description of the emerging OntoLex module for Frequency, Attestations, and Corpus-Based Information (OntoLex-FrAC) that is intended to complement OntoLex-Lemon with the necessary vocabulary to represent major types of information found in or automatically derived from corpora, for applications in both language technology and the language sciences
The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes
In the current context of Big Data, a multitude of new NoSQL solutions for
storing, managing, and extracting information and patterns from semi-structured
data have been proposed and implemented. These solutions were developed to
relieve the issue of rigid data structures present in relational databases, by
introducing semi-structured and flexible schema design. As current data
generated by different sources and devices, especially from IoT sensors and
actuators, use either XML or JSON format, depending on the application,
database technologies that store and query semi-structured data in XML format
are needed. Thus, Native XML Databases, which were initially designed to
manipulate XML data using standardized querying languages, i.e., XQuery and
XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently,
the majority of these solutions have been replaced with the more modern JSON
based Database Management Systems. However, we believe that XML-based solutions
can still deliver performance in executing complex queries on heterogeneous
collections. Unfortunately nowadays, research lacks a clear comparison of the
scalability and performance for database technologies that store and query
documents in XML versus the more modern JSON format. Moreover, to the best of
our knowledge, there are no Big Data-compliant benchmarks for such database
technologies. In this paper, we present a comparison for selected
Document-Oriented Database Systems that either use the XML format to encode
documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB,
CouchDB, and Couchbase. To underline the performance differences we also
propose a benchmark that uses a heterogeneous complex schema on a large DBLP
corpus.Comment: 28 pages, 6 figures, 7 table
Modelling collocations in OntoLex-FrAC
Following presentations of frequency and attestations, and embeddings and distributional similarity, this paper introduces the third cornerstone of the emerging OntoLex module for Frequency, Attestation and Corpus-based Information, OntoLex-FrAC. We provide an RDF vocabulary for collocations, established as a consensus over contributions from five different institutions and numerous data sets, with the goal of eliciting feedback from reviewers, workshop audience and the scientific community in preparation of the final consolidation of the OntoLex-FrAC module, whose publication as a W3C community report is foreseen for the end of this year. The novel collocation component of OntoLex-FrAC is described in application to a lexicographic resource and corpus-based collocation scores available from the web, and finally, we demonstrate the capability and genericity of the model by showing how to retrieve and aggregate collocation information by means of SPARQL, and its export to a tabular format, so that it can be easily processed in downstream applications
Izrada OWL ontologije za prikaz, povezivanje i pretraživanje SemAF diskursnih oznaka
Linguistic Linked Open Data (LLOD) are technologies that provide a powerful instrument for representing and interpreting language phenomena on a web-scale. The main objective of this paper is to demonstrate how LLOD technologies can be applied to represent and annotate a corpus composed of multiword discourse markers, and what the effects of this are. In particular, it is our aim to apply semantic web standards such as RDF and OWL for publishing and integrating data. We present a novel scheme for discourse annotation that combines ISO standards describing discourse relations and dialogue acts – ISO DR-Core (ISO 24617-8) and ISO-Dialogue Acts (ISO 24617-2) in 9 languages (cf. Silvano and Damova 2022; Silvano, et al. 2022). We develop an OWL ontology to formalize that scheme, provide a newly annotated dataset and link its RDF edition with the ontology. Consequently, we describe the conjoint querying of the ontology and the annotations by means of SPARQL, the standard query language for the web of data. The ultimate result is that we are able to perform queries over multiple, interlinked datasets with complex internal structure. This is a first, but essential step, in developing novel, powerful, and groundbreaking means for the corpus-based study of multilingual discourse, communication analysis, or attitudes discovery.Diskursni markeri jezični su znakovi koji pokazuju kako se iskaz odnosi na kontekst diskursa i koju ulogu ima u razgovoru. Lingvistički povezani otvoreni podatci (LLOD) tehnologije su u nastajanju koje omogućuju snažan instrument za prikaz i tumačenje jezičnih fenomena na razini weba. Glavni je cilj ovoga rada pokazati kako se tehnologije lingvistički povezanih otvorenih podataka (LLOD) mogu primijeniti za prikaz i označavanje korpusa višerječnih diskursnih markera te koji su učinci toga. Konkretno, naš je cilj primijeniti standarde semantičkoga weba kao što su RDF i Web Ontology Language (OWL) za objavljivanje i integraciju podataka. Autori predstavljaju novu shemu za označavanje diskursa koja kombinira ISO standarde za opis diskursnih odnosa i dijaloških činova – ISO DR-Core (ISO 24617-8) i ISO-Dialogue Acts (ISO 24617-2) na devet jezika (usp. Silvano, Purificação et al. 2022a; Silvano, Purificação et al. 2022b). Razvijamo OWL ontologiju kako bismo formalizirali tu shemu, pružili nov označeni skup podataka i povezali njegovu RDF inačicu s ontologijom. U skladu s tim opisujemo zajedničko postavljanje upita ontologiji i oznakama s pomoću SPARQL-a, standardnoga jezika upita za web podataka. Konačni je rezultat taj da možemo izvršiti upite nad višestrukim, međusobno povezanim skupovima podataka sa složenom unutarnjom strukturom bez potrebe za ikakvim specijaliziranim softverom. Umjesto toga upotrebljavaju se gotove tehnologije utemeljene na web standardima koje se bez napora mogu prenijeti na različite operativne sustave, baze podataka i programske jezike. Ovo je prvi, ali prijeloman korak u razvoju novih, snažnih i (u određenom trenutku) pristupačnih sredstava za korpusno utemeljena istraživanja višejezičnoga diskursa te za analizu komunikacije i otkrivanje stavova
Historiae, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case
The paper proposes an interdisciplinary approach including methods from disciplines such as history of
concepts, linguistics, natural language processing (NLP) and Semantic Web, to create a comparative
framework for detecting semantic change in multilingual historical corpora and generating diachronic
ontologies as linguistic linked open data (LLOD). Initiated as a use case (UC4.2.1) within the COST
Action Nexus Linguarum, European network for Web-centred linguistic data science, the study will
explore emerging trends in knowledge extraction, analysis and representation from linguistic data
science, and apply the devised methodology to datasets in the humanities to trace the evolution
of concepts from the domain of socio-cultural transformation. The paper will describe the main
elements of the methodological framework and preliminary planning of the intended workflow
Tracing Semantic Change with Multilingual LLOD and Diachronic Word Embeddings
Purpose: The project will combine word embedding techniques and linguistic
linked open data (LLOD) with theoretical aspects from lexical semantics, the history of
concepts, and knowledge organization to trace the evolution of concepts in a collection
of multilingual diachronic corpora of seven extinct and extant languages (Latin, Ancient Greek, Hebrew, French, Old Lithuanian, Romanian, German). The outcome will
consist of a sample of diachronic ontologies to be published on the LLOD cloud. It will
also comprise reflections on the potential interconnections across different languages
that can be built through these knowledge structures
Workflow Reversal and Data Wrangling in Multilingual Diachronic Analysis and Linguistic Linked Open Data Modelling
peer reviewedThe article deals with data wrangling in a multilingual collection intended for diachronic analysis and linguistic linked open data modelling for tracing concept change over time. Two types of static word embeddings are used: word2vec (French and Hebrew data sets), and fastText (Latin and Lithuanian data sets). We model examples from these embeddings via the OntoLex-FrAC formalism. To address the challenge of heterogeneity, we use a minimalist workflow design allowing for both convergence and flexibility in attaining the project goals.CA18209 - European network for Web-centred linguistic data science (NexusLinguarum
Towards a Conversational Web? A Benchmark for Analysing Semantic Change with Conversational Bots and Linked Open Data
peer reviewedThe paper presents preliminary results from our experiments with large language models, linked data, and semantic change in multilingual diachronic contexts. It proposes the first steps towards a benchmark and aims at fostering discussion on the concept of conversational knowledge bots as emerging paradigms, and the use of linked open data in linguistic tasks.CA18209 - European network for Web-centred linguistic data science (NexusLinguarum